Programmazione di processori in parallelo massivo: un approccio pratico: il modello di esecuzione CUDA: Host vs. Device

Il modello di esecuzione CUDA trasforma il tuo computer in un sistema eterogeneo ad alte prestazioni. Immagina un Grande Direttore (l'Host/CPU) e un Esercito di migliaia (il Device/GPU). Il Direttore gestisce logiche complesse e prende decisioni, mentre l'Esercito esegue compiti massivi e ripetitivi simultaneamente.

1. La divisione architettonica

L' Host è un CPU ottimizzato per la latenza, progettato per flussi di controllo complessi e compiti seriali. Viceversa, il Device è un GPU ottimizzato per il throughput, contenente migliaia di core semplici progettati per eseguire la stessa istruzione su set di dati enormi contemporaneamente.

2. Il ritmo di esecuzione

Un programma CUDA funziona come una serie di fasi. L'esecuzione inizia sull'Host per il "codice seriale." Quando il programma raggiunge un "Kernel parallelo," avvia un Griglia di thread sul Device. Il controllo torna all'Host una volta che il Device ha completato il suo carico di lavoro massiccio.

3. Specializzazione delle prestazioni

Il modello sfrutta i punti di forza di entrambi: il CPU gestisce le risorse di sistema e i rami complessi, mentre il GPU esegue SPMD (Single-Program, Multiple-Data) logica per elaborare elementi di dati in parallelo.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Which architecture is characterized as being 'throughput-optimized'?

The Host (Intel® CPU)

The Device (NVIDIA® GPU)

The System RAM

The PCIe Bus

QUESTION 2

The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.

float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);

float Nd, Pd; malloc(&Nd, size); ... free(Nd);

float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;

int Nd, Pd; Nd = new float[size]; ... free(Nd);

QUESTION 3

In the CUDA execution model, where does a program always begin its execution?

On the Device (GPU)

Simultaneously on both

On the Host (CPU)

In the Global Memory

QUESTION 4

What happens when the Host encounters a phase with rich data parallelism?

It speeds up its clock frequency.

It launches a Kernel onto the Device.

It stores the data in the Host Cache.

It converts the code to Python.

QUESTION 5

A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?

The G80 cannot handle 1024 blocks.

The total number of threads exceeds 1 million.

The configuration results in 1024 threads per block, exceeding the 512 hardware limit.

Matrix multiplication is not data parallel.